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ABSTRACT 

An analysis of the skills necessary for performance 
on the Test of English as a Foreign Language (TOEFL) tends to support 
the view that there are important, although subtle, secondary 
dimensions present in the test. This research explored the 
feasibility of an item response theory (IRT) based method of modeling 
examiree performance on these secondary ability dimensions* Both 
exploratory multidimensional IRT (MIRT) and confirmatory 
multidimensional IRT (CMIRT) models were investigated in the study. 
The work performed included the application of unidimensional IRT, 
MIRT, and CMIRT models in two TOEFL forms to evaluate the extent to 
which model fit is enhanced by using a multidimensional model and to 
determine to what extent the additional fitted ability dimensions 
correspond to meaningful cognitive processes or content areas. 
Results indicate that the MIRT and CMIRT procedures were successful 
in modeling secondary ability dimensions on TOEFL and that they 
provide corroborative evidence in interpreting the structure of the 
test that is consistent with previous structure interpretations. The 
data also illustrate how the consistent Akaike information criterion 
can identify the best competing models of test structure. Four 
figuries (plots) and seven tables illustrate the discussion. (Contains 
34 references;) (SLD) 
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The Test of EngUsh as a Foreign Language (TOER?) was developed in 1963 by a National Council 
on the Testing of English as a Foreign Language, which was formed through the coq)erative effort of 
more than thirty organizations, public and private, that were concerned with testing the English 
proficiency of nonnaiive speakers of the language applying for admission to institutions in the United 
States. In 1965. Educational Testing Service (ETS) and the College Board assumed joint responsi- 
bility for the program, and in 1973, a cooperative arrangement for the operation of the program was 
entered into by ETS. the College Board, and the Graduate Record Examinations (GRE) Board. The 
membership of the College Board is composed of schools, colleges, school systems, and educational 
associations; GRE Board members are associated with graduate education. 

ETS administers the TOEFL program under the general direction of a Policy Council that was 
established by. and is affiliated with, the sponsoring organizations. Members of the Policy Council 
represent the College Board and the GRE Board and s uch i nstitutions and agencies as graduate schools 
of business, junior and community colleges, nonprofit educational exchange agencies, and agencies 
of the United States government 
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A continuing program of research related to the TOEFL test is carried out under the direction of the 
TOEFL Research Committee. Its six members mclude representatives of the Policy Council, the 
TOEFL Committee of Examiners, and distinguished English as a second language specialists from the 
academic community. Currently the Committee meets twice yearly to review and approve i^oposals 
for test-related research and to set guidelines for the entire scope of the TOEFL research program. 
Members of the Research Committee serve three-year terms at the invitation of the Policy Council; 
the chair of the committee serves on the Policy Council. 

Because the studies are specific to the test and the testing program, most of the actual research is 
conducted by ETS staff rather than by outside researchers. However, many projects require the 
cooperation of other institutions, particularly those with programs in the teaching of English as a 
foreign or second language. Representatives of such programs who are interested in participating in 
or conducting TOEFL-related research are invited to contact the TOEFL program office. All TOEFL 
research projects must undergo appropriate ETS review to ascertain that the confidentiality of data will 
be protected. 

Current (1991-92) members of the TOEFL Research Committee are: 

James Dean Brown University of Hawaii 

Patricia Dunkel (Chair) Pennsylvania State University 

William Grabe Northern Arizona University 

Kyle Perkins Southern Illinois University at Carbondale 

Elizabeth C. Traugott Stanford University 

John Upshur Concordia University 
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ABSTRACT 



An analysis of the skills necessary for performance on the TOEFL^ test tends to support the view that there ax'C 
important, although perhaps subtle, secoui^ary dimensions present in the test. Given that these subtle secondary 
ability dimensions may be present in examinee response data» and that they do represent meaningful 
psychological variables^ the purpose of this research was to explore the feasibility of an IRT-based method of 
modeling examinee performance on the^ secondary ability dimensions. The procedure investigated is based on 
a multidimensional extension of the IRT mcdcl currently used for equating the TOEFL test. Both e^qploratory 
multidimensional IRT (MIRT) and confirmatory multidimensional IRT (CNfIRT) models were investigated in 
the study. The work performed included the application of unidimensional IRT, MIRT, and CM IRT models 
to two TOEFL forms to evaluate the extent to v^ch model fit is ennanced by using a multidimensional models 
and to determine to what extent the additional fitted ability dimensions correspond to meaningful cognitive 
processes or content areas. 

The results of this study indicate that the MIRT and CMIRT procedures were successful in modeling secondary 
ability dimensions on TOEFL. The two procediu'es provided corroborative evidence in interpreting the structure 
of the test that was consistent with previous interpretations of the testes structure. The data presented in this 
study also provide an illustration of how a particular criterion for assessing model fit -the consistent Akaike 
information criterion--can be utilized to identify the best of several competing models of test structure. 
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INTRODUCTION 



Factor analytic research on the Test of English as a Foreign Language (TOEFL) seems to lead to the conclusion 
that the test measures primarily one factor. For example, in a factor analytic study of seven different language 
groups, Sainton and Powers (1980) obtained first-to-second eigenvalue ratios ranpng from 7.1 to 93, depending 
on the language group. When the language groups were combined, a ratio of 8.9 was obtained. Similar results 
were obtained by Hale, Stansfield, Rock, Hicks, Butler, and OUer (1988) using both factor analytic and item 
response theory (IRT) based methods for assessing dimensionality, by Dunbar (1982) and Hale, Rock, and Jirele 
(1989), who used confirmatory factor analysis approaches, and by Boldt (1988) usmg latent structure analysis. 

However, an analysis of the skills necessary for performance on the TOEFL test tends to support the view that 
there are important, although perhaps subtle, secondary dunensions present in the test (Duran, Canale, Penfield, 
Stansfield, & Liskin-Gasparro, 1985). Note, for instance, that the test is constructed to have five content 
components: listening comprehension, structure, written expression, vocabulary, and reading comprehension. 
Each section includes items measuring a variety of content subarcas. For example, the listening comprehension 
section includes items requiring knowledge of syntax, lexical items, and items focusing on phonology, stress, and 
intonation. Moreover, there are a variety of required skills in common to items, not only within sections but also 
across sections (Duran et a!., 1985). Clearly, then, the TOEFL test includes items measuring a variety of content 
areas and cognitive processes. Consequently, it would seem reasonable to expect to find at least some empirical 
evidence of these dimensions in examinee response data. To some extent, in fact, this expectation is borne out 
by the research cited above. 

For instance, Swinton and Powers (1980) concluded that there was evidence of three factors underlying examinee 
performance on the TOEFL test, although the interpretation of the factors tended to vary with native language 
and ability level. Hale et al. (1988) concluded that there was evidence of two factors, one related to listening 
comprehension, and one related to the remainder of the test. A latter study by Hale, Rock, and Jirele (1989) 
suggested a consistent two-factor structure of the TOEFL test across several language groups. Dunbar (1982) 
found evidence of four factors: one general factor and one secondary factor associated with each of the three 
TOEFL sections. Oltman, Strieker, and Barrows (1988) used a three-way multidimensional scaling approach to 
examine the effects of native language and English proficiency on the structure of the TOEFL test and found 
evidence of three dimensions that corresponded to the sections of the examination, ^nd a fourth that was 
identified as an "end-of-test phenomenon." 

Given that these subtle secondary ability dimensions may be present in examinee response data, and that they 
do represent meaningful psychological variables, it would be useful to have an IRT-based procedure that could 
(1) confirm that the such dimensions are or at^^^ not present in a particular form of the test and (2) extract 
information about these abilities, when present, for mdividual examinees. Such a procedure could yield valuable 
dividends. For instance, it might be possible to use such a procedure to provide meaningful feedback to 
examinees regarding their performance on specific content areas and item types. In fact, the ability to provide 
useful diagnostics might be enhanced by providing feedback from the procedure to test developers, who could 
use the information as a guide to test construction. 

The purpose of this research was to explore the feasibility of an IRT- based method of modeling examinee 
performance on these secondary dimensions. The procedure investigated is based on a multidimensional 
extension of the IRT model currently used for equating the TOEFL test. The work performed included the 
application of both unidimensional and multidimensional IRT models to two forms of the TOEFL test to 
evaluate the extent to which model fit is enhanced by using various multidimensional models, and to determine 
to what extent the additional fitted ability dimensions correspond to test content. 
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Item Responae Theory 
Ucidimensioul Item Response Theory 

In recent years, item response theory has become a very popular tool both in research and in practical 
measurement applications. The attractiveness of IRT derives primarily from its parameter invariance properties 
and the availabiUty of well-defined standard errors of estimate (Bock & Aitldn, 1981; Lord, 1980; Lord & Novick, 
1968). Moreover, the ability estimate standard errors are expressed as functions of ability. Thus, the standard 
error of estimate is reported for each level of ability rather than for a test as a whole, as is the case with more 
traditional measurement procedures. As a consequence, not only does the use of IRT improve the quality of 
measurement, but it makes possible applications that would be prohibitively difficult or impossible with more 
traditional measurement procedures. 

Among the item response theory models that have been developed are the two-parameter normal oj^ve model 
(Lord, 1952); the two-parameter logistic model (Bimbaimi, 1958); the one-parameter logistic, or Rasch, model 
(Rasch, 1960); and the three-parameter logistic model (Bimbaum, 1968). By far the most widely used of these 
models are the one-parameter logistic (IPL) model and the three-parameter logistic (3PL) model. The model 
used for equating the TOEFL is the 3PL model, which was used in this research. 

The 3PL model is given by 

PiOj) « c,+(l-c,)/(1.0+exp(-Da,(ej-b,))) , (1) 

where q is the pseudo-guessing parameter for item i; a^ is the item discrimination parameter for item i; b^ is the 
item difficulty parameter for item i; and 6^ is the ability parameter for examinee j. The D in the e)q)onent is 
equal to 1.702, and is included to make the logistic curve more closely approxunate what would be obtained using 
a normal ogive model. 

Although item response theory has proven to be a very powerful and useful measurement tool, use of IRT 
models has been somewhat limited because the available models require the assumption that the test being 
analyzed measures only a single ability dimension. This imidimensionallty assumption often limits the application 
of IRT-based methods to tests consisting of relatively homogeneous sets of items, such as might be found on a 
vocabulary test. Tests that include items sampled from several content areas, such as a science test containing 
both physics and chemistry items, are probably not sufficiently homogeneous to permit analysis using IRT. Such 
may also be the case \^th tests containing multifaceted items, such as a language test containing English structure 
items requiring a high level of reading comprehension or vocabulary skill. 

In recent years, attempts have been made to extend IRT to the case of multidimensional tests. In 
multidimensional IRT, or MIRT, examinee responses are modeled as a function of a set of examinee traits, and 
the assumption of unidimensionality is replaced by the less restrictive requirement that the dimensionality of the 
item responses matches the dimensionality of the set of examinee traits used in the MIRT model. 

Multidimensional IRT 

As with factor analysis, there are two basic approaches to MIRT analysis-exploratory and confirmatory. In 
exploratory procedures, the emphasis is on discovering the best fitting model, while in confirmatory approaches 
the focus is on evaluatmg the extent to which the data follow a hypothetical model developed a priori on the 
basis of content and process analysis of the instrument to be analyzed. Both approaches were examined in this 
research. 
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Exploratoiy Models 



Most of the recent progress in MIRT research has occurred in two areas-thc development of a multidimensional 
two-parameter normal ogive nonlinear factor analysis model (Bock, Gibbons, & Muraki, 1985), and the 
development of multidimensional two- and three-parameter logistic IRT models (McKiniey, 1983, 1987). Work 
on these procedures is still at an early stage, but it has progressed to the point that estimation procedures are 
avaOable. For the nonlinear factor analysis procedure, the TESTFACT program (Wilson, Wood, & Gibbons, 
1984) is available. For the two-parameter logistic IRT model, the MAXLOG program (McKmley & Reckase, 
1983) is available, and for the three-parameter model, the MULTIDIM program (McKinley, 1987) is available. 

For this research, the MIRT model employed was the multidunensional three-parameter logistic (M3PL) model 
This model was selected for two reasons: miaocomputer-based estimation procedures are available for use with 
the M3PL model, and it is closely related to the model currently used with the TOEFL test. 

The M3PL model is given by 

PiOj) - c, + (l^c,)/(l4-exp(~1.702(b, + a^^Sj))) , (2) 

where Pi(8j) is the probability of a correct response to item i by examinee j; is the ability parameter vector 
of examinee j; a. is the discrimination parameter vector for item i; b^ is the threshold parameter for item i; and 
q is the lower asymptote parameter for item i. The ability and discrimination parameter vectors contain one 
element for each dimension. 



Uses of the M3PL model thus far have been somewhat limited, primarily due to the recency of the development 
of estimation procedures for applying the model. The model was, however, applied to a French proficiency exam 
by Kaya-Carton (1988). In this application, parameters of the M2PL model were obtained using the MULTIDIM 
program (item lower asymptote parameters were fixed at zero). The method was compared to maximum 
Likelihood factor analysis and boolean factor analysis. Despite the shortness of the test (18 items) and the small 
sample size (between 700 and 800 examinees), the results obtained were positive. The MIRT solution was found 
to be interpretable and consistent with the factor analysis solutions. 

Confirmatory Models 

As mentioned above, the goal of confirmatory MIRT, or CMIRT, analysis is to confirm or disconfirm the 
presence of some hypothesized test structure. In CMIRT, competing hypothesized models are applied to data 
and compared using some measure of model ftt. The CMIRT procedure used in this research was based on a 
modification of the M3PL model. 

Adaptation of the M3PL model for use in confirmatory analysis consists of imposing a set of constraints on the 
item discrimination parameters in accordance with an a priori target test structure. As an example, consider a 
sbc-item English test containing three vocabulary items and fhree reading comprehension items. One possible 
structure for such a test might be given by 



S - 



111111 
1110 0 0 
0 0 0 1 1 1 



(3) 



where rows are dimensions and columns are items. A 1 indicates that an item discrimination parameter will be 
estimated for the item on that dimension, and a 0 indicates that the item discrimination parameter on that 
dimension will be constrained to 0.0. Thus, in this example, the first dimension corresponds to a general latent 
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trait, the second dimension represents a trait specific to the first three items, and the third dimension represents 
a trait specific to the last three items. (For computational details about CKfIRT procedures, see McKinley, 1968, 
1989.) 

Like the M3PL MIRT model, CMIRT models have not been widely used due to the recency erf their 
development. In one application reported by Kingston and McKinley (1988), CMIRT procedures were applied 
to the Graduate Record Examination's General Test. Results, which were consistent with previous researdi on 
that test, indicated the presence of a general dunension, a verbal dimension, a quantitative dimension, and a weak 
analytical dimension defined by logical reasoning items (but not analytical reasoning items). 

MEIHQQ 

1^ 

Data for this study comprised the responses to the 146 operational TOEFL items for the November 1987 and 
May 1988 administrations. Responses were sampled for 2,500 randomly selected exammees from domestic test 
centers and for 2,500 randomly selected examinees from foreign test centers. Sampling was performed by 
selecting all candidates mih low and high number-correct scores and equal numbers of candidates at each score 
level in between.* Foreign and domestic examinees were analyzed separately, as were the two forms, as a aeans 
of cross-validating the results. 

Estimation 

For all analyses in this study, solutions were based on the M3PL model, except that the c-parameter was not 
estimated. Rather, it was held fixed at a value of 0.2. The parameters of the model were estimated using an 
EM algorithm based procedure similar to those described by Bock and Aitkin (1981), Mislevy and Bock (1985), 
and Bock, Gibbons, and Muraki (1985). The algorithm has been implemented in the MULTIDIM program 
(McKinley, 1987), which is designed for exploratory MIRT analysis, and in the CONFIRM program (McKinley, 
1989), which is designed for both exploratory and confirmatory analyses. 

In this algorithm, item parameter estimation is performed usmg a two-step marginal maximum likelihood 
procedure. Multiple latent abilities are hypothesized, each is treated as a random variable, and integration over 
the jcnnt distributions of these random variables is performed. The integration over the ability distribution, 
accomplished through numerical quadrature, is performed during the first step, the E (expectation) step, and 
produces an expected sample size and number-correct score at each quadrature node for each item. These 
values are used in the second step, the M (maximization) step, to perform marginal maximum likelihood item 
parameter estimation. 

Model Evaluation 

In MIRT, evaluation of model-data fit is relatively straightforward. Solutions for simple models (such as the 
unidimensional 3PL model) are obtained first. Models of increasing complexity are then created by adding 
parameters. These more complex models subsume simpler models, making it possible to test the significance 
of the contribution of the additional parameters using a chi-square procedure such as is implemented in 
TESTFACT (Wilson, Wood, & Gibbons, 1984). 



^This sampling procedure was used to increase the sensitivity of the analyses, particularly at lower levels of 
ability. Although this procedure may have been inconsistent mih the assumptions of the mar^nal maximum 
likelihood procedure, the effects of the sampling procedure on the results of the model estimations are difficult 
to predict, as the relationship that exists between observed score distributions and latent ability distributions is 
complex. It is recognized that random examinee samples may have yielded different results. 

4 



ER?C 



12 



For example, assume that one- and two-dimensional MIRT solutions have been obtained on the same data using 
a multidimensional 2PL model. Comparing the solutions can be accomplished by computing, for each solution, 
a measure of fit such as the likelihood ratio chi-square statistic (Bock, Gibbons, & Muraki, 1985). This statistic 
is ^ven by 

J 

- 2 S ln(rj/NPj). , (4) 

j-1 

where J is the mmiber of possible unique response strings for the item set to be calibrated, r^ is the number of 
examinees with response string j, N is the total number of examinees in the calibration sample, and Pj is 
computed as 

q 

Pj - S Lj(Sk)Wi, , (5) 

k-1 

where Pj represents the marginal likelihood of observing response string j, L^(^ is the likelihood of observing 
response string j given an ability vector equal to 2^, and 2^ and W,, are the quadrature nodes and weights used 
for numerically integrating over the ability distribution. The degrees of freedom for the statistic given by 
Equation 4 are given by 

df - 2^-n(m+2) , (6) 

where n is the number of items and m is the number of dimensions. If the c-parameter is not estimated, then 
the second term on the right is n(m + 1). For CMIRT models it is necessary to reduce Equation 6 by the number 
of parameters constrained to 0.0. 

While it is doubtful that the statistic given by Equation 4 is actually distributed as a chi-square, the difference 
between the values of for subsuming models has been shown to be asymptotically distributed as chi-square 
(Haberman, 1977). The degrees of freedom for the difference between two values of G^ is equal to the 
difference between Equation 6 for the two solutions. For IRT models^ this equals the difference in the number 
of item parameters estimated. 

Another way in which two competing models of test structure can be compared is based on the work of Akaike 
(1973, 1987). This approach is based on a criterion called the entropic information criterion (Bozdogan, 1987), 
also known as the AIC, and involves evaluating model fit in terms of the natural logarithm of the likelihood of 
the solution, which is presumed to be an approximation of the expected log likelihood of the true model. The 
greater the likelihood of the solution (in practice, the lower the negative log likelihood), the closer the fitted 
model is presumed to approximate the true model. This approach is particularly useful in the context of CMIRT 
analysis, since competing models often are not subsuming and therefore cannot be compared using the chi-square 
procedure described above. 

The AlC statistic is given by 

AIC - -2 log(L) + 2k , (7) 

where log(L) denotes the natural log of the likelihood and k is the number of parameters estimated. The 2k 
term constitutes a sort of penalty function that penalizes overparameterization. 
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A variation on the AIC, called the consistent AIC (CAIC)» was proposed by Bo:u]ogan (1967). This statistic was 
derived in response to criticism that the AIC statistic does not provide an asymptotically consistent estimate of 
model order (Bozdogan, 1987). 



The CAIC statistic is pven by 



CMC - -2 log(L) -f k(log(N)-fl) 



(8) 



where N is the sample size. This modification of the AIC has the effect of inaeasing the penalty for 
overparameterization and, consequently, tends to lead to the selection of simpler models. All three statistics 
were used in this research. 



The first analysis performed in this study was to fit a unidimensional IRT model to both the foreign test center 
data and the data from the domestic test centers, and to evaluate the goodness-of*fit of the model for both sets 
of data using the chi-square, AIC, and CAIC statistics. Following this, re^duals were computed and analyzed, 
using a procedure described by Div^ (1980), to determine whether there appeared to be any interpretable 
common variance remaining after the model was fit to the data. Residuals were computed as 



- u 



ij 



(9) 



where Py is the probability of the observed response to item i by examinee j predicted from the IRT model, and 
Ujj is the observed response. Residual correlation matrices were then analyzed using principal components 
analysis. The advantage of this procedure is that residuals are on a continuous scale, which reduces the potential 
problems that may occur when principal components are applied to interitem phi or tetrachoric correlations. 

After this, two-, three-, and four-dimensional MIRT models were fit to both datasets, and the goodness-of-fit 
evaluated. Item discrimination vectors were examined to determine whether the fitted ability dimeuMons 
corresponded to content areas or cognitive processes. Residual analyses were also performed using Divgi^s 
(1980) procedure. 

Finally, the CMIRT procedure was used to impose several hypothesized test structures on the two sets of data. 
CMIRT solutions were obtained for one 2-dimensional structure, three 3-dimensional structures, and two 
4*dimensional structures. These target structures are summarized in Table 1, which indicates the dimensions 
for which discrimination parameters were estimated in a given CMIRT solution. The abbreviations for the test 
structures used in Table 1 will be used throughout the report. For example, 3DC3 will refer to the 
three-dimensional solution (3D), configuration 3 (C3). 
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TABLE 1 



Summary of CMIRT Target Test Structures 



Solutlon/S tructure 
2 Dim. 3 Dim. 4 Dim. 



Section 
/Part 


No. 
Items 


2DC1 


3DC1 


3DC2 


3DC3 


4DC1 


4DC2 


Listening Comp 


50 














Statements 


20 


1.2 


1.2 


1.2 


1.2 


1.2 


1.2 


Dialogues 


15 


1.2 


1.2 


1.2 


1.2 


1.2 


1.2 


Mini talks 


15 


1.2 


1.2 


1.2 


1.2 


1.2 


1.2 


Structure & 
















Written Exp. 


38 














Structure 


14 


1 


1,3 


1.3 


1.3 


1.3 


1.3 


Written Exp. 


24 


1 


1.3 


1.3 


1.3 


1.3 


1.3 


Vocabulary & 
















Reading Comp. 


58 














Vocabulary 


29 


1 


1.3 


1.3 


1.3 


1.4 


1.3 


Reading Comp. 


29 


1 . 


1.3 


1.2 


1 


1.4 


1.4 



Of the above structures, the 2DC1, 3DC1, and 4DC1 solutions are most consistent with TOEFL structures that 
have been suggested by previous research. In each of these solutions, a discrimination parameter is estimated 
for each item on the first dimension, which can be thought of as a general factor. In Solution 2DC1) the second 
dimension is constrained so that discrimination parameters are estimated only for listening comprehension items. 
In Solution 3DC1, the second dimension is constrained as in Solution 2DC1, while discrimination parameters 
on the third dimension are estimated for structure and written expression and vocabulary and reading 
comprehension items. In Solution 4DC1, discrimination parameter estimates are obtained for the items in each 
of the three TOEFL test sections on three distinct secondary dimensions. Solutions 3DC2^ 3DC3, and 4£>C2 
were hypothesized to allow reading comprehension items to measure different abilities from vocabulary items. 
For example, in Solution 3DC2, reading comprehension items were hypothesized to measure the same ability 
dimension as listening comprehension items, whereas the vocabulary items were hypothesized to measure the 
same ability dimension as the structure and written expression items. In Solution 3DC3, reading comprehension 
was hypothesized to be measured only by the general factor. In Solution 4DC2, reading comprehension items 
were hypothesized to measure a distinct secondary ability dimension. To evaluate the various CMIRT structures, 
the goodness-of-fit of the different CMIRT solutions were compared to each other and to the MIRT solutions. 
Residual analyses were also performed. 

RESULTS 

Sampling 

The number-correct score means and standard deviations, along with sample sizes and KR-20 reliability 
estimates, for the four sets of data sampled for this study are shown in Table 2. Although the four sets of data 
were similar with regard to these statistics, there are some important differences. Note, for example, that the 
number-correct score means are somewhat higher for the samples of domestic examinees and the standard 
deviations are somewhat lower. Despite the nonrandom sampling procedures, these results are consistent with 
results typically seen on the operational TOEFL test. 



TABLE 2 Number ••Correct Score Suxnmary Statistics 

by Form and Sample 



Statistic 



Form/Sample 



Mean 



S. D. 



KR-20 



November 1987 
Foreign (JM) 
Domestic (JP) 



2,538 
2,551 



90.2 
92.2 



32.1 
30.9 



0.98 
0.97 



May 1988 



Foreign (K9) 
Domestic (KA) 



2,537 
2,509 



87.7 
90.3 



33.6 
31.7 



0.98 
0.98 



It can also be seen from Table 2 that the scores were somewhat higher and less variable for the November 1987 
samples (hereinafter referred to as JM for the foreign examinee sample and JP for the domestic sample) than 
for the May 1988 samples (K9 for the foreign sample and KA for the domestic sample). These differences are 
likely a reflection of differences in average item difficulty between the t^o test forms, rather than an indication 
of differences in group ability. 

Model Fit-Exploratorv Analyses 

Model Selection Criteria 

Table 3 summarizes the model selection criteria for the exploratory MIRT analyses performed on the four sets 
of data. For each test form and examinee sample, Table 3 shows the likelihood ratio chi-square, AIC, and CAIC 
statistics for the unconstrained one-dimensional (ID), two-dimensional (2D), three-dimensional (3D), and four 
dimensional (4D) solutions. 

As the data in Table 3 indicate, the relative ordering of solutions was consistent across for the chi-square and 
the AlC selection criteria: for all data sets, the 4D solution would be considered optimal, followed, in order, by 
the 3D, 2D, and ID solutions. In each sample, the differences between chi-square statistics were testable and 
were found to be statistically significant. Across solutions, the decreases in the chi-square and AIC statistics from 
the ID to the 2D solutions were largest, and the decreases in these statistics from the 3D to the 4D solutions 
were relatively small. 

Using the G\IC statistics, however, results in different orderings of the exploratory solutions. For each data 
sample, the CAIC statistics indicate that the 3D solutions are optimal. For two of the samples (JM and KA), 
the CAIC statistic is lower for the 2D solution than it is for the 4D solution. The differences between the CAIC 
results and those obtained for the chi-square and AIC statistics are clearly due to the penalty for 
overparameterization that is incorporated into the CAIC statistic (see Equation 8). McKinley (1989) points out 
that both the CAIC and AIC statistics essentially embody a '^critical value" for testing whether a particular model 
is the best fitting one. Selecting the CAIC statistic over the AIC statistic is equivalent to selecting a larger 
critical value, which reduces the Type I error rate. In fact, the main advantage of the CAIC statistic is that the 
Type I error rate decreases exponentially with increased sample size. Asymptotically, Type I error for the CAIC 
statistic goes to zero (Bozdogan, 1987). 
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TABLE 3 Exploratory Analysis Model Selection Criteria 



Sample 


Number 
of 




Criterion 




Dimensions 


Chl-Sqiiare 


AIC 


CAIC 


JM 












ID 


334251.6 


374618.8 


376615.8 




2D 


327455.7 


368114.8 


371110.4 




3D 


325720.6 


366671.7 


370665.8 




4D 


325112.5 


366355.6 


371348.2 


JP 












ID 


332826.3 


373415.0 


375413.5 




2D 


32?494.5 


369375.2 


372373.0 




3D 


326372.6 


367545.3 


371542.3 




4D 


325445.9 


366910.5 


371906.8 


K9 












ID 


331088.4 


371407.0 


373403.9 




2D 


324155.5 


364766 1 


367761 5 




3D 


321895.4 


362798.0 


366791.8 




4D 


320813.6 


362008.2 


367000.5 


KA 












ID 


329103.0 


368949.4 


370943.1 




2D 


325507.5 


365645.9 


368636.4 




3D 


323915.3 


364345.8 


368333.1 




4D 


323093.3 


.363815.8 


368800.0 



Analysis of Residuals 

Table 4 provides a summary of the analysis of residuals performed on each MIRT solution. For each examinee 
sample for each test form, Table 4 provides the first three eigenvalues and the percentage of variance accounted 
for by each from a principal components analysis of Pearson correlations computed on residuals. Also shown 
are the ratios of the first to second and second to third eigenvalues. 
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TABLE 4 Sunimary of the Principal Components Analysis of 
Residuals for the Exploratory Analyses 



Sample/ Eigenvalue (% of Variance) Ratios 

Number of 

Dimensions E^ E2 E3 ^1/^2 E2/E3 



JM 


















ID 


5.03(3.5) 


2.31(1.6) 


2. 


00(1.4) 


2. 


18 


1. 


16 


2D 


2.35(1.6) 


2.06(1.4) 


1. 


83(1.3) 


1. 


14 


1. 


13 


3D 


2.13(1.5) 


1.86(1.3) 


1. 


78(1.2) 


1. 


15. 


1. 


04 


4D 


1.97(1.3) 


1.85(1.3) 


1. 


73(1.2) 


1. 


07 


1. 


07 


JP 


















ID 


3.54(2.4) 


2.94(2.0) 


2. 


06(1.4) 


1. 


21 


1. 


43 


2D 


2.96(2.0) 


2.08(1.4) 


1. 


77(1.2) 


1. 


42 


1. 


18 


3D 


2.11(1.4) 


1.79(1.2) 


1. 


71(1.2) 


1. 


18 


1. 


05 


4D 


1.80(1.2) 


1.74(1.2) 


1. 


66(1.1) 


1 


03 


1. 


05 


K9 


















ID 


4.96(3.4) 


2.88(2.0) 


2 


39(1.6) 


1 


72 


1 


21 


2D 


2.91(2.0) 


2.42(1.7) 


2 


03(1.4) 


1 


.20 


1 


19 


3D 


2.40(1.6) 


2.06(1.4) 


1 


.87(1.3) 


1 


.17 


1 


.10 


4D 


2.03(1.4) 


1.89(1.3) 


1 


.79(1.2) 


1 


.07 


1 


.05 


KA 


















ID 


3.18(2.2) 


2.51(1.7) 


2 


.18(1.5) 


1 


.26 


1 


.15 


2D 


2.56(1.8) 


2.22(1.5) 


1 


.79(1.2) 


1 


.15 


1 


.24 


3D 


2.24(1.5) 


1.79(1.2) 


1 


.68(1.2) 


1 


.25 


1 


.07 


4D 


1.81(1.2) 


1.72(1.2) 


1 


.65(1.1) 


1 


.05 


1 


.04 



For the 4D solution, the results reported in Table 4 indicate essentially no meaningful variation remaining in the 
residuals. For all data sets, increasing the number of parameter^ estimated reduced the magnitudes of all three 
eigenvalues, mih the most pronounced reduction occurring when two discrimination parameters were estimated 
instead of one. However, other than the changes apparent in going from the ID tc 2D solutions, there are no 
consistent trends in these data across samples. For the ID data, the magnitudes of the first eigenvalues were 
appreciably greater in the foreign data samples (forms JM and K9) than in the domestic data samples. This 
suggests that departures from unidimensionality were perhaps more severe for the foreign samples than for the 
domestic samples. 

Model Flt-Confirmatory Analyses 

Model Selection Criteria 

Table 5 summarizes the model selection criteria for the confirmatory MIRT analyses. For each test form and 
examinee sample. Table 5 shows the likelihood ratio chi-square, AIC, and CAIC statistics for each of the 
hypothesized test structures. 
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TABLE 5 Confirmatory Analysis Model Selection Criteria 



Criterion 



Sample 


Solution 


Chl-Square 


AIC 


CAIC 


JM 












2DC1 


329812.3 


370279.5 


372618.4 




3DC1 


327340.4 


367999.5 


370995.1 




3DC2 


328002.3 


368661.4 


371656 9 




3DC3 


328383.9 


368985.1 


371782 3 




4DC1 


328963.5 


369622.6 


372618.1 




4DC2 


328463.5 


369122.6 


372118.1 


JP 












2DC1 


331262.0 


371950.7 


374291.4 




3DC1 


329167.7 


370048.4 


373046.1 




3DC2 


329787.2 


370667.9 


373665 7 




3DC3 


330721.3 


371543.9 


374343 2 




4DC1 


330754.0 


371634.6 


374632.4 




4DC2 


330739.1 


371619.8 


374617.6 


K9 












2DC1 


327098.7 


367517.3 


369856.1 




3DC1 


323674.9 


364285.5 


367280.8 




3DC2 


324199.2 


364809.8 


367805.2 








or qi QO 7 


oo/yo / . / 




4DC1 


325520.4 


366131.0 


369126.4 




4DC2 


324000.2 


364610.8 


367606.2 


KA 












2DC1 


327987.3 


367933.8 


370268.8 




3DC1 


325888.2 


366026.7 


369017.2 




3DC2 


326501.9 


366640.4 


369630.9 




3DC3 


326989.9 


367070.3 


369867.8 




4DC1 


327196.4 


367334.9 


370325.4 




4DC2 


327001.7 


367140.2 


370130.7 



Unlike the exploratory analyses, where the optimal solution differed according to the different model selection 
criteria, in the confirmatory analyses the chi-square, AIC, and CAIC criteria all indicated that Solution 3E>C1 
was optimal for each sample. In fact, the chi-square and AIC statistics resulted in the same ordering of solutions 
for each sample, and only for Sample JP was the ordering obtained using the CAIC statistic different from those 
obtained using the chi-square and AIC statistics. 

For the most part, the model selection criteria indicated that the 3D confirmatory solutions were preferable to 
the 4D confirmatory solutions. This was somewhat surprising, as Solution 4DC1 was most consistent with the 
current configuration of the TOEFL test sections. Solution 3DC1 suggests that the TOEFL test is characterized 
by a general dimension, a dimension associated with listening comprehension, and a dimension associated with 
the other sections of the test. This solution is consistent with interpretations of the TOEFL test structure 
suggested by Hale et al. (1988) and Hale, Rock, and Jirele (1989). 

Comparing the results in Table 5 to those in Table 3, it can be seen that for the foreign samples (JM and K9), 
the CAIC statistic for Solution 3DC1 was lower than all the CAIC statistics except those for the 3D exploratory 
solutions. However, for the domestic samples (JP and KA), the CAIC statistics for the 2D, 3D, and 4D 
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exploratory solutions were all lower than the CAIC statistics for any of the confirmatory solutions. A possible 
explanation for this result is that there is probably a more salient distinction between the listening comprehension 
section of the TOEFL test and the other sections for foreign examinees than for domestic examinees. Foreign 
examinees typically have more trouble ^th the listening comprehen^on section of the test than they do with the 
other sections. Domestic examinees tend to perform better on listening comprehension relative to the other 
sections on the test because they have had more opportunities to listen to spoken English in natural settings. 

Analysis of Residuals 

Table 6 provides a summary of the analysis of residuals performed on each CMIRT solution. For each examinee 
sample for each test form. Table 6 provides the first three eigenvalues and the percentage of variance accounted 
for by each from a principal components analysis of Pearson chelations computed on residuals. Also shown 
are the ratios of the first to second and second to third eigenvalues. 

The data in Table 6 indicate little variation in the magnitudes of the eigenvalues across the 3D and 4D 
confmnatory solutions. In general^ at least one of the three eigenvalues for each of the 2D solutions tends to 
be higher than the corresponding eigenvalues for the 3D and 4D solutions. 
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TABLE 6 



Summary of the Principal Components Analysis of 
Residuals for the Confirmatory Analyses 



Sample/ Eigenvalue (% of Variance) Ratios 

Number of 

Dimensions Ex Ej ' Ej E1/E2 E2/E3 



JM 



JP 



K9 



ZukjL 


0 A 1 / 1 *7 \ 


0 A/. / I A \ 


1 . oD (1 . 0; 


1 
1 


18 


1 . 


10 


JUKjL 


0 on / ^ ^\ 

Z . ZU ^1.3^ 


0 n 7 / 1 A \ 


1 QO / i 0 \ 

1 . oZ (1 . J) 


1 , 


Uo 


1. 


1 A 

14 


OUKjZ 


0 1 / 1 Q \ 
Z . Lo(L,D) 


1 07 / 1 0 \ 


1 0 A / 1 0 \ 

1. o4(l . 3) 


1. 


10 


1. 


07 




Z , LO\L.D) 




1 00/1 0 \ 

l.oz(1.2) 


1 


12 


1. 


06 




0 1 Q / 1 C \ 

z . lb ( 1 . d; 


1 . (1 . J; 


l.o5( 1. 3) 


1 


11 


1 . 


06 


4DC2 


2.10(1.4) 


1.92(1.3) 


1.81(1.2) 


1 


10 


1. 


06 


ZUKjL 


0 0*5/0 ON 
J . ZO \Z » Z ) 


0 n Q / 1 A \ 
z . uy ( 1 . 


1 7Q / 1 0 \ 

1 . /o ( 1 . Z; 


1 


C A 


1. 


17 






1 fi Q / 1 IN 
1 . 0 J V J- . 0^ 


1 00/1 1 \ 

1 . oZ ( 1 . 0; 


1 


, ZD 


1 , 


02 


JUKjZ 


0 £7/1 Q \ 
Z . 0 / V i- • 0^ 


1 Q Q / 1 1 \ 

1 . 00 V 1 . 


1 0 A / 1 0 \ 

1 . oU(l . Z ; 


1 


► 42 


1, 


04 




0 Q7 / 0 n\ 
Z . 0 / ^ Z . 


1 Q7 / 1 \ 


1 01/1 0 \ 
1 . oHl . Z ; 


1 


, 46 


1 , 


Art 

09 


H-UKjL 


0 An/1 ft^ 


1 . y J ( 1 • / 


1 7Q /I 0 \ 

1 . /y V 1 • z ^ 


1 


, 


1 . 


AQ 

Uo 


hT)C9 

) 






J. . 0\J ^ J. . z ^ 


1 
J. 


. HO 




ns 

U J 


2DC1 


3.11(2.1) 


2.41(1.7) 


2.07(1.4) 


1 


.29 


1, 


16 


3DC1 


2.42(1.7) 


2.17(1.5) 


2.08(1.4) 


1 


.12 


1, 


04 


3DC2 


2.62(1.8) 


2.19(1.5) 


2.08(1.4) 


1 


.20 


1 


05 


3DC3 


2.58(1.8) 


2.13(1.5) 


2.08(1.4) 


1 


.21 


1 


03 


4DC1 


2.34(1.6) 


2.19(1.5) 


1.85(1.3) 


1 


.07 


1 


18 


4DC2 


2.41(1.7) 


2.09(1.4) 


1.84(1.3) 


1 


.15 


1 


13 


i 

2DC1 


2.55(1.8) 


2.23(1.5) 


1.79(1.2) 


1 


.15 


1 


.25 


3DC1 


2.39(1.6) 


2.03(1.4) 


1.79(1.2) 


1 


.18 


1 


.13 


3DC2 


2.57(1.8) 


1.80(1.2) 


1.77(1.2) 


1 


.43 


1 


.02 


3DC3 


2.53(1.7) 


1.83(1.3) 


1.79(1.2) 


1 


.39 


1 


.02 


4DC1 


2.49(1.7) 


1.99(1.4) 


1.75(1.2) 


1 


.25 


1 


.14 


4DC2 


2.40(1.6) 


1.81(1.2) 


1.78(1.2) 


1 


.32 


1 


.02 



Summary of Model Fit 

For both the exploratory and confirmatory analyses carried out in this study, the model selection criteria 
suggested th?>t the 4D interpretations of the TOEFL test were not optimal. The CAIC statistics for the 
exploratory 4D solutions were higher than the CAIC statistics for 3D solutions and, in some samples, were higher 
than the 2D CAIC statistics. The principal components analyses of residuals were less conclusive, but they did 
not indicate that the 4D solutions were clearly preferable to the 2D and 3D solutions. In the confirmatory 
analyses, all of the model selection criteria indicated that Solution 3DC1 was preferable to both the 4DC1 and 
4DC2 solutions. 

Intcrpretabilitv of Results 

As in the case of factor analysis, exploratory MIRT solutions are subject to rotational indeterminancy. 
Therefore, extreme caution should be used in examining the unrotated multidimensional item and ability 
parameter estimates for the exploratory solutions. Further, comparisons between solutions for the foreign and 
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domestic samples for the same test forms (U., comparisons between solutions for Forms JM and JP, and 
between solutions for Forms K9 and KA) are problematic wUiout equating, in tiie case of both tiie exploratory 
and confirmatory solutions. Because of time and budget constraints, no rotations or equatings of solutions were 
attempted in tiie context of tiiis study. To provide some interpretation, however, graphical representations of 
tiie exploratory 3D solutions were inspected to determine if patterns in tiie multidimenaonal a-parameter 
estimates were related to content areas of tiie test. In addition, means and standard deviations of item parameter 
estimates for tiie confirmatory solution 3DC1 were calculated by TOEFL test section. 

Exploratoiy Results 

Bivariate plots of tiie 3D item discrimination estimates for each form and sample are provided in Figures 1 
tiu-ough 4. Each figure has tiu-ee plots: tiie aa-estimates graphed against tiie aj-estimates, tiie aj-estimates 
graphed against tiie aj-estimates, and tiie aj-estimates graphed against tiie a^-estimates. In each plot, the items 
in Section 1 (SI) are represented by squares, tiie items in Section 2 (S2) by plusses, the vocabulary items m 
Section 3 (S3-Voc) by diamonds, and the reading comprehension items in Section 3 (S3-RC) by triangles. 

In Figure 1, the plot of the a^ estimates against the aj-estimates (top graph) indicates a separate clustering of 
the Section 1 items from the items in Sections 2 and 3. A similar clustering can be seen in the plot of the 
a3-estimates against the a^-estimates (bottom graph), suggesting that the second ability dimension in this solution 
primarily measures listemng. The plots involving the aj-estiinates (particularly tiie middle graph) indicate that 
the reading comprehension items of Section 3 tend to have the highest discrimmation values on tiiis dimension. 

In Figure 2, the plot of the a^-estimates against the aj-estimates (top graph) for Form JP is very similar to the 
plot seen in Figure 1 for Form JM. Again, the Section 1 items cluster distinctly from the Section 2 and 3 items. 
However, contrary to the plots in Figure 1, there appears to be no content-related pattern apparent in tiie 
a3-estimates (middle and bottom graphs). 

In Figure 3, the observed patterns are similar to those seen in Figure 1, except that the roles of tiseaj- and 
aj-estimates are reversed. That is, the aj-estimates (rather tiian the aj-estimates as in Figure 1) are clearly higher 
for the Section 1 items compared to the Section 2 and 3 items (middle and bottom graphs). Similarly, the 
highest a^-estimates (rather than aj-estimates as in Figure 1) are seen abnost exclusively for reading 
comprehension items (top graph). These differences between Figures 1 and 3 are insignificant, as tiie dimensions 
defined by the aj- and aj-estimates are arbitrary because of the rotational indeterminancy of the MIRT solutions. 
In Figure 4, the clustering in the plot of the aj-estimates against the aj-estimates again appears to separate the 
Section 1 items from the Section 2 and 3 items. In this case, the aj-estimates are clearly higher for tiie Seaion 
2 and 3 items than for the Section 1 items. As was the case in Figure 2, the plots of the aj-estimates against tiie 
ai- and aj-estimates did not suggest any meaningful content-related pattern. 

In summary, Figures 1 through 4 did proNide support for considering the listening comprehension section as 
measuring a latent ability distinct from the ability measured by the remaming sections of the TOEFL test. In 
addition. Figures 1 through 4 suggested that the pattern of the multidhnensiona! item parameter estimates for 
the reading comprehension items in Section 3 differed in the foreign and domestic samples. The reasons for 
thb are not clear, and may warrant further investigation, particularly as the TOEFL program is currently 
researching the possibility of revising Section 3 to eliminate discrete vocabulary items. 
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Figure 1: Plots of 3-Dimensional A-Values - Fonn JM 
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Figure 2: Plots of 3-Dimensional A-Values - Form JP 
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Conflniuitoiy Results - 3DC1 



Table 7 presents the means and standard deviations of the item parameter estimates for each CMIRT solution 
obtained for the 3DC1 data. These summary statistics are displayed for the relevant a^^ aj^ a)^ and b-estimates 
from Section 1» Section 2, Section 3 vocabulary^ and Section 3 reading comprehension. Although the 
confirmatory solutions are not subject to rotational indeterminancy» it should be noted that the latent scales for 
these solutions have not been equated and should be compared vAth caution. 

TABLE 7 Means and Standard Deviations of Item Parameter Estimates 
by Content Area for Confirmatory Solution 3DC1 



Examinee Sample 



JM JP K9 KA 
Content/ 



Parameter 


N 


Mean 


S.D. 


Mean 


S 


.D. 


Mean 


S.D. 


Mean 


S.D. 


SI ai 


50 


0.86 


0.31 


0.79 


0. 


27 


0.77 


0.29 


0.83 


0.38 


S2 ai 


38 , 


0.97 


0.37 


0.85 


0. 


30 


1.03 


0.26 


0.83 


0.21 


S3-VOC ai 


29 


0.87 


0.27 


0.76 


0. 


23 


0.94 


0.29 


0.94 


0.30 


S3-RC aj 


29 


0.98 


0.32 


0.90 


0. 


28 


1.00 


0.34 


0.85 


0.27 


SI a2 


50 


0.54 


0.18 


0.51 


0. 


19 


0.52 


0.22 


0.47 


0.23 


S2 aj 


38 


0.53 


0.21 


0.38 


0. 


22 


0.67 


0.19 


0.51 


0.16 


S3-VOC a J 


29 


0.54 


0.20 


0.48 


0 


34 


0.64 


0.24 


0.43 


0.16 


S3-RC aa 


29 


0.34 


0.17 


0.42 


0 


20 


0.27 


0.16 


0.38 


0.16 


SI b 


50 


-0.02 


0.75 


0.27 


0 


.80 


-0.26 


0.75 


-0.13 


0.86 


S2 b 


38 


0.38 


0.69 


0.21 


0 


.75 


0.68 


0.76 


0.30 


0.82 


S3-VOC b 


29 


-0.02 


1.14 


-0.02 


1 


.05 


0.37 


1.01 


-0.08 


1.14 


S3-RC b 


29 


-0.02 


0.87 


0.24 


0 


.89 


-0.14 


0.75 


0.10 


0.77 



It is worth noting that the average reading comprehension a3-estimates for Forms JM and K9 in Table 7 are 
much lower than the average aj-estimates for Section 2 or vocabulary items. In addition^ the average difficulties 
of the listening items compared to the other items tend to follow different patterns in i:he foreign and domestic 
samples. (Contrary to more common IRT estimation programs, in the CONHRM program, higher b-estimates 
correspond to easier items and lower b-estimates correspond to more difficult items.) 

DISCUSSION 

The results of the exploratory analyses supported the interpretation that the TOEFL test is characterized by 
essentially three latent ability dimensions. For each sample analyzed, the CAIC statistics were lowest for the 
3D solutions, and for two of the samples (JM and KA), the CAIC statistic was lower for the 2D solution than 
for the 4D solution. Although the CAIC statistics suggested that a three-dimensional structure of the TOEFL 
test was optimal, the authors were unable to find a consistently meaningful content-related interpretation for all 
three dimensions. Inspection of plots of the item discrimination estimates for the exploratory 3D solutions 
suggested that the listening comprehension section of the TOEFL examination measures a different ability 
dimension than the nonlistening sections of the test. In addition, for the foreign samples, there was some 
evidence from the item discrimination estimates that reading comprehension items could be differentiated from 
Section 2 and the vocabulary items of Section 3. 
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The best fitting model examined m the confirmatory analyses was Solution 3DC1, which suggests the odstcncc 
of a general ability dimension, a secondary ability dimension measuring listening comprehension, and a secondary 
ability dimension measuring a combination of structure and written expression and vocabulary and reading 
comprehension. An alternate model (4DC1), which allowed for a general ability dimension and secondary 
dimensions associated with each section of the test, proved clearly to be a less satisfactory solution, as did the 
4DC2 model, which differs from the 4DC1 model in that the vocabulary portion of the reading comprehension 
section is associated with the secondary ability dimension corresponcUng to the structure and written expression 
section. 

The results with respect to solution 3DC1 tend to confirm the interpretation of the TOEFL offered by Hale et 
al. (1988). and Hale, Rock, and Jirele (1989), although these authors did not extract a general ability dunension 
in their analyses. In the present study, the CMIRT algorithm employed required the extraction of a general 
ability dimension. Thus, it should be noted that other plausible confirmatory structures exist that may provide 
better fit to the data investigated in this study. For example, it is possible that the structure of the TOEFL test 
could be better supported by a solution with three correlated dimensions (one for eacli section). In future 
investigations, it might be worthwhile to fit CMIRT structures to TOEFL data that do not require every item 
to load on a general factor. 

Although the models identified as best fitting in this study were consistent across test forms and examinee 
samples, there did appear to be tentative evidence of differences in the salience of different dimensions, 
depending upon whether the fitted data were based on foreign or domestic samples. A reasonable explanation 
for this finding is that the makeup of foreign and domestic examinees taking the TOEFL tends to differ in terms 
of native language and proficiency in various aspects of the English language. Several studies with TOEFL test 
data have suggested that item performance, test equating, and the structural interpretation of the test may differ 
according to examinees' native language and/or English proficiency (Alderman & Holland, 1981; Golub-Smith, 
1986; Oltman, Strieker, & Barrows, 1988). Furthermore, in operational administrations of the TOEFL test there 
is consistent evidence of differences in the way foreign and domestic examinees perform on the listening 
comprehension section, compared to the other sections of the test. These differences appear to be related to 
differential experience in informal and formal exposure to English. Informal exposure would tend to emphasize 
communicative aspects, such as speaking and listening, while formal exposure would tend to emphasize grammar, 
vocabulary, and reading. 



The results of this study indicated that the MIRT and CMIRT procedures were quite successful in modeling 
secondary ability dimensions on the TOEFL. The two procedures provided corroborative evidence in interpreting 
the structure of the test. This evidence was consistent with previous interpretations of the test's structure, and 
was verified by examining the characteristics of item parameter estimates in the various solutions obtained in the 
study. The data presented in this study also illustrated how the consistent Akaike information criterion can be 
utilized to identify the best of several competing models of test structure. 

Several areas of research related to application of the MIRT and CMIRT models could not be investigated in 
this study but may be of interest for future investigations. For example, in the present study, no attempt was 
made to rotate different solutions (for example, the foreign and domestic samples) to a common orientation, or 
to equate the multidimensional estimates for different solutions to a common scale. As in the case of 
unidimensional IRT, parameter estimates of the same items obtained in separate calibrations are not directly 
comparable until they have been transformed to a common scale. Although some progress has been made in 
this area (cf. Ackerman, 1990; Hirsch, 1989; Reckase, Davey, & Ackerman, 1989), research must continue to 
address this problem if practical use of MIRT parameter estimates is to be made in applications such as equating 
and test development. Another possibility for investigation in future studies could be based on comparisons of 
ability stimates obtained in unidimensional and MIRT solutions that have been transformed to an estimated 
true score scale. Such transformations might provide information about how much practical impact fitting 
additional ability dimension would have on examinees' scores. Finally, future applications could attempt to use 
MIRT and CMIRT procedures to provide diagnostic feedback to test assemblers for the purpose of modifying 
the TOEFL test design. Such feedback could be used to enhance the measurement of important secondary 
abilities, or to eUminate them if they were undesired. 
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